In this project, we will explore a data set on red wines quality. Our main objective is to explore the chemical properties influences the quality of red wines. This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. The data set is available here and information about the data set is available here.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
X is an unique identifier.X and quality which are integer.## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
As we are primarily interested in quality of the red wines, let’s see some basic statistics about it
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
As we can see the entire wine quality are in the range of 3 to 8 with the most common values are 5 and 6 and the least common values are 3, 4, 7, and 8. So, we create another variable rating with rate given below.
## poor good ideal
## 63 1319 217
To calculate sum of all acids in the red wines, we create new variable total.acidity.
## [1] 8.10 8.68 8.60 12.04 8.10 8.06
fixed.acidity, volatile.acidity, sulfur.dioxide, sulphated and alcohol are appeared to be long tailed.density and pH are normally distributed with few outliers.residual.sugar and chlorides have extreme outliers.citric.acid contains large number of zero values.Taking log_10, we can see that fixed.acidity, volatile.acidity, and sulphates are normally distributed, with some few outliers.
citric.acid## [1] 132
We found that 132 observations have zero values.
residual.sugar and chlorides after removing some extreme outlierschlorides are now normally distributedresidual.sugar comes in wide range as, it’ is’s rare to find wines with less than 1 gm/liter and wines with greater than 45 gm/liter are considered sweet, so the range of 1 - 4 as we found in the plot are ok with some outliers.## 'data.frame': 1599 obs. of 15 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ rating : Ord.factor w/ 3 levels "poor"<"good"<..: 2 2 2 2 2 2 2 3 3 2 ...
## $ total.acidity : num 8.1 8.68 8.6 12.04 8.1 ...
As our main objective is to conclude quality. So, it’s the main feature.
density and pH are also normally distributed as our new variable rating. So, these two can help support our analysis.
rating, level each variables as ‘poor’, ‘good’ and ‘ideal’.total.acidity to calculate sum of all acids.residual.sugar and chlorides contains many outliers but after doing some operations, chlorides get into normal distributioncitric.acid have very large number of zero values but after reading documentation it’s fine as it found in small quantities.qualityratingpoor rating seems to have following trends:
fixed.acidity, higher volatile.acidity and lower citric.acidsulfur.dioxide and sulphatespH and high ```densitygood rating seems to have following trends:
fixed.acidity and volatile.aciditysulfur.dioxidepH and higher densityideal rating seems to have following trends:
fixed.acidity , lower volatile.acidity and higher citric.acidsulfur.dioxide and higher sulphatespH and density## fixed.acidity volatile.acidity citric.acid
## 0.12405165 -0.39055778 0.22637251
## residual.sugar chlorides free.sulfur.dioxide
## 0.01373164 -0.12890656 -0.05065606
## total.sulfur.dioxide density pH
## -0.18510029 -0.17491923 -0.05773139
## sulphates alcohol total.acidity
## 0.25139708 0.47616632 0.10375373
Following variables show strong correlations with quality
alcoholsulphatescitric.acidfixed.acidityLet’s see some relationships between some variable and total.acidity
We see that there is approx. linear relationship between density and log10(total.acidity), and pH and log10(total.acidity).
I plot graph between four variables citric.acid, fixed.acidity, sulphates and alcohol which shown high correlations with quality and faceted them with rating. I conclude that higher citric.acid and lower fixed.acidity yields better wines. Better wines also have higher alcohol and sulphates and lower pH.
As alcohol is highly correlated with the quality, it is better to see its pattern with varying rating. From the above plot, it clearly shows higher % of alcohol yields better wine.
As more the acidic better is the wine. It would be better to see which acids have more impact on wine quality. Above plot shows, fixed.acidity and citric.acid have highly correlated with quality but volatile.acidity has negative impact on quality.
From the above plot, it shows there is linear relationships between pH and total.acidity. This means lower the pH, more acidic the wines. This also substantiate that higher the acidic tends to better the wine.
It would be great to see the real pattern between good and bad wines. Above plot differentiate between good and bad wines. It shows higher the % of alcohol and higher the sulphates give better wines.
After this EDA, I can conclude that the major factors for better wine quality is alcohol, acidity and sulphates. These features must be in required content otherwise negative impact will effect the wine quality. Also, we can’t be totally sure about quality index also it has been taken some experts. We’ve also concluded that there is linear relationship between pH and quality with negative slope.
One thing that is still unclear is the amount of residual.sugar. It contains many outliers, also after doing some operation we get its common range from 1 to 4. But we can’t find its amount for ideal wine quality. I think more future research need to be done to find its ideal quantity.